Skip to content

Conversation

@bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Oct 29, 2025

While working on #15805, I noticed the current cpy_flt kernel has significant uncoalesced global access.
cpy_op

This is particularly bad if one tries to make a transposed tensor contiguous by cur = ggml_cont(ctx, ggml_transpose(ctx, cur));. Some simple benchmarks in test-bankend-ops: perf -o CPY on 4090

master using permute to simulate transpose

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                132 runs -  7642.58 us/run -  1572864 kB/run -  200.73 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                172 runs -  6662.13 us/run -   786432 kB/run -  113.89 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                344 runs -  2966.67 us/run -   786432 kB/run -  255.75 GB/s

Using shared memory is a common way to make coalesced global memory access. I implemented another copy kernel and get 2x-4x boost.
This PR with src tensor transposed

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                484 runs -  2119.14 us/run -  1572864 kB/run -  723.92 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                817 runs -  1276.55 us/run -   786432 kB/run -  594.35 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):               1118 runs -   917.03 us/run -   786432 kB/run -  827.37 GB/s

Currently I use src tensor's OP == TRANSPOSE to let ggml_cpy_flt_cuda pick customized transposed copy. I am not sure if there is a better way to make this choice. Your suggestions are welcome.

…void uncoalesced access; test cases also added shwoing improved memory bandwidth
@bssrdf bssrdf requested a review from slaren as a code owner October 29, 2025 12:54
@CISC
Copy link
Collaborator

CISC commented Oct 29, 2025

Any reason for not including BF16 as well?

@bssrdf
Copy link
Contributor Author

bssrdf commented Oct 29, 2025

Any reason for not including BF16 as well?

Just added support for BF16. Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

@CISC
Copy link
Collaborator

CISC commented Oct 29, 2025

Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

Probably only relevant for quantized KV cache (perhaps not even then, unsure), so not a big issue.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025
@am17an
Copy link
Collaborator

am17an commented Oct 29, 2025

@bssrdf this is failing the CI, do you mind taking a look?

@bssrdf
Copy link
Contributor Author

bssrdf commented Oct 29, 2025

@bssrdf this is failing the CI, do you mind taking a look?

Yeah, something is wrong. Using the OP equality to make a decision is likely not robust. Will look into it. Thanks.

Edit: I ran ci/run.sh locally on my machine: wsl2 ubuntu. The rerank test passed.

rerank score 0:    0.171
rerank score 1:    0.169
rerank score 2:    0.189

0.00.608.024 I llama_perf_context_print:        load time =     321.37 ms
0.00.608.025 I llama_perf_context_print: prompt eval time =     105.71 ms /    62 tokens (    1.70 ms per token,   586.52 tokens per second)
0.00.608.026 I llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
0.00.608.027 I llama_perf_context_print:       total time =     106.26 ms /    63 tokens
0.00.608.027 I llama_perf_context_print:    graphs reused =          0

real    0m0.864s
user    0m0.399s
sys     0m0.355s
  - rerank score 0 @ 0.171 OK
  - rerank score 1 @ 0.169 OK
  - rerank score 2 @ 0.189 OK

I am not sure what 's going on. The 3 score are the same, but it passed on my machine and failed at CI.

@am17an
Copy link
Collaborator

am17an commented Oct 30, 2025

Try running it through compute-sanitizer

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of memory bandwidth for FP16 and BF16, 32*2=64 bytes is still suboptimal vs. the 64*2=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

@bssrdf
Copy link
Contributor Author

bssrdf commented Oct 30, 2025

In terms of memory bandwidth for FP16 and BF16, 322=64 bytes is still suboptimal vs. the 642=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

@JohannesGaessler, thanks for the comments. I realized the logic of triggering transposed copy is not right. I am working on a more general way of copying. The bottom line is to avoid uncoalesced access as much as possible, which really reduces the bandwidth.

@JohannesGaessler
Copy link
Collaborator

Is this PR in a state where you want to move towards merging it?

@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 2, 2025

Is this PR in a state where you want to move towards merging it?

Made another push restoring some old test cases. Now ready to move forward. Thanks.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cpy_flt_transpose kernels still needed for a use case that cannot be covered by cpy_flt_coalesced?

@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 3, 2025

Is the cpy_flt_transpose kernels still needed for a use case that cannot be covered by cpy_flt_coalesced?

@JohannesGaessler , thank you for the review. I'll need to work on it one by one.
The original intent is only to improve transposed copy, but there are some cases where permute is composed with transpose and
cpy_flt_transpose failed. Maybe I just need a more strict condition check. cpy_flt_transpose is faster than cpy_flt_coalesced, while cpy_flt_coalesced can handle more general permuted cases.

@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 3, 2025

I have decided to remove cpy_flt_coalesced and focus on the transpose case solely. The logic in cpy_flt_coalesced is too complicated and not very clean.

Permute is also handled if it is in fact a transpose. Code has been updated with your suggestions.

Current master

  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               132 runs -  7841.80 us/run -  1572864 kB/run -  195.63 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               172 runs -  6818.98 us/run -   786432 kB/run -  111.27 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               344 runs -  3060.79 us/run -   786432 kB/run -  247.88 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):                     344 runs -  3059.14 us/run -   786432 kB/run -  248.02 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               132 runs -  7836.65 us/run -  1572864 kB/run -  195.76 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               330 runs -  3153.94 us/run -  1572864 kB/run -  486.40 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               172 runs -  6802.20 us/run -   786432 kB/run -  111.54 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               344 runs -  3057.42 us/run -   786432 kB/run -  248.16 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):                     344 runs -  3056.52 us/run -   786432 kB/run -  248.23 GB/s

This PR

  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               462 runs -  2202.66 us/run -  1572864 kB/run -  696.47 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               731 runs -  1400.65 us/run -   786432 kB/run -  541.69 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):              1075 runs -   948.73 us/run -   786432 kB/run -  799.72 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):                    1032 runs -   972.86 us/run -   786432 kB/run -  779.89 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               462 runs -  2190.42 us/run -  1572864 kB/run -  700.37 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               506 runs -  2034.23 us/run -  1572864 kB/run -  754.14 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               688 runs -  1458.58 us/run -   786432 kB/run -  520.18 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):              1075 runs -   950.66 us/run -   786432 kB/run -  798.10 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):                    1032 runs -   972.45 us/run -   786432 kB/run -  780.21 GB/s

@@ -1,3 +1,4 @@
#include <algorithm>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <algorithm>

I think this header is no longer needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made a few adjustments to reduce bank conflicts for fp16 and bf16 types. Seems getting a small bump in bandwidth for those types.

  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               462 runs -  2247.70 us/run -  1572864 kB/run -  682.52 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               731 runs -  1392.03 us/run -   786432 kB/run -  545.05 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):              1118 runs -   922.83 us/run -   786432 kB/run -  822.17 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):                    1118 runs -   923.18 us/run -   786432 kB/run -  821.86 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               484 runs -  2143.72 us/run -  1572864 kB/run -  715.62 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               528 runs -  1952.04 us/run -  1572864 kB/run -  785.89 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               731 runs -  1381.74 us/run -   786432 kB/run -  549.11 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):              1118 runs -   926.10 us/run -   786432 kB/run -  819.27 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):                    1075 runs -   934.95 us/run -   786432 kB/run -  811.51 GB/s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up with padding instead of swizzling. Padding is better since smem requirement is loose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants